데이터 전처리 및 가공

Corpus 생성

txt<-system.file('texts', 'txt', package='tm')

# system 폴더의 /tm/texts/txt 디렉토리에서 가져옴 정의

txt

"/opt/homebrew/lib/R/4.1/site-library/tm/texts/txt"

DirSource, VectorSource, DataframeSource 함수를 통해 Corpus 생성을 위한 소스 생성

ovid<-Corpus(DirSource(txt), readerControl=list(language='lat'))

ovid

Metadata: corpus specific: 1, document level (indexed): 0

Content: documents: 5

ovid[[1]]$content

[1] " Si quis in hoc artem populo non novit amandi,\n hoc legat et lecto carmine doctus amet.\n arte citae veloque rates remoque moventur,\n arte leves currus: arte regendus amor.\n\n curribus Automedon lentisque erat aptus habenis,\n Tiphys in Haemonia puppe magister erat:\n me Venus artificem tenero praefecit Amori;\n Tiphys et Automedon dicar Amoris ego.\n ille quidem ferus est et qui mihi saepe repugnet:\n\n sed puer est, aetas mollis et apta regi.\n Phillyrides puerum cithara perfecit Achillem,\n atque animos placida contudit arte feros.\n qui totiens socios, totiens exterruit hostes,\n creditur annosum pertimuisse senem."

# tm을 이용해서 읽어 들일 reader의 종류

tm::getReader()

트위터를 통해 읽어들인 bigdata.text(Vector) Corpus로 변환

my.corpus<-Corpus(VectorSource(bigdata.text))

my.corpus

<<SimpleCorpus>>

Metadata: corpus specific: 1, document level (indexed): 0

Content: documents: 1000

my.corpus[[1]]$content

[1] "RT @mannny_fr: Day 99 of #100DaysOfCode \n\nTrying to get in the rhythm of learning this JavaScript! One More day!! #programming #CodeNewbie…"

# inspect 함수를 이용해 array 지정해서 읽기

inspect(my.corpus[1:2])

<<SimpleCorpus>>

Metadata: corpus specific: 1, document level (indexed): 0

Content: documents: 2

[1] RT @mannny_fr: Day 99 of #100DaysOfCode \n\nTrying to get in the rhythm of learning this JavaScript! One More day!! #programming #CodeNewbie…

[2] RT @Khulood_Almani: The #Metaverse➡️A Different Perspective \n\nhttps://t.co/7GG16At8m0\n\nv/@BBNTimes_en \n\n#AR #VR #NFTs #web3 #blockchain #Te…

tm_map()

tm_map 함수를 이용해서 Corpus 형식의 데이터들의 변형을 할 수 있다.

Corpus 형식의 데이터에 일반적인 함수를 적용해서 전처리 등의 가공을 할 수 있다.

# tm에서 제공하는 데이터 가공 함수

tm::getTransformations()

[1] "removeNumbers" "removePunctuation" "removeWords"

[4] "stemDocument" "stripWhitespace"

위의 함수를 제외하고 일반함수를 Corpus 타입에 적용하게 되면,

오류가 발생할 수 있다.

# 기본 Tranformation 함수

my.corpus<-tm_map(my.corpus, stripWhitespace)

# 기본 Transformation 함수가 아닌 경우

my.corpus<-tm_map(my.corpus, content_transformer(tolower))

# content_tranformer() 안에서는 정상 동작

my.corpus<-tm_map(my.corpus, tolower)

# 그냥 tolwer() 함수를 적용했을 경우, 아래 명령어 실행

my.corpus<-tm_map(my.corpus, PlainTextDocument)

단어 대체

my.corpus<-tm_map(my.corpus, content_transformer(gsub), pattern="@\\S*", replacement='')

my.corpus<-tm_map(my.corpus, content_transformer(gsub), pattern="http\\S*", replacement='')

문자부호 및 구두점 제거

my.corpus<-tm_map(my.corpus, removePunctuation)

stopwords 제거

my.corpus<-tm_map(my.corpus, removeWords, stopwords('en'))

# add stopwords

my_stopwords<-c(stopwords('en'), 'rt', 'via', 'even')

my.corpus<-tm_map(my.corpus, removeWords, my_stopwords)

자연어 처리

DataMining에서 자연어 처리는 기본적으로 형태소 분석 과정을 포함한다.

영문의 경우,

접속사, 대명사 등을 제거하고, 공통 어간을 가지는 단어를 묶는 스테밍(stemming) 과정을 포함한다.

tm은 한글에 대한 stopwords를 제공하지 않기 때문에 다른 패키지를 함께 사용하여야 한다.

Stemming

tm은 stemming을 위해 stemDocument 함수와 stemCompletion 함수를 제공한다.

test<-stemDocument(c('updated', 'update', 'updating'))

test

[1] "updat" "updat" "updat"

test<-stemCompletion(test, dictionary=c('updated', 'update', 'updating'))

test

updat updat updat

"update" "update" "update"

위처럼 stemming을 위해서는 dictionary를 필요로 한다.

dict.corpus<-my.corpus # 현재의 corpus를 dictionary로 생성

my.corpus<-tm_map(my.corpus, stemDocument)

stemCompletion_mod<-function(x, dict){

PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x), " ")), dictionary=dict, type='first'), sep="", collapse=" ")))

}

my.corpus<-lapply(my.corpus, stemCompletion_mod, dict=dict.corpus)

my.corpus<-Corpus(VectorSource(my.corpus))

inspect(my.corpus[1:2])

TDM(Term-Document Matrix)

my.TDM<-TermDocumentMatrix(my.corpus)

단어 사전에 대한 TDM

myDict<-c('bigdata', 'data', 'analyst', 'cloud', 'company', 'privacy', 'analytics', 'business', 'hadoop', 'datascience')

my.TDM<-TermDocumentMatrix(my.corpus, control=list(dictionary=myDict))

inspect(my.TDM[, 60:70])

<<TermDocumentMatrix (terms: 10, documents: 11)>>

Non-/sparse entries: 6/104

Sparsity : 95%

Maximal term length: 11

Weighting : term frequency (tf)

Sample :

Docs

Terms 60 61 62 63 64 65 66 67 68 69

analyst 0 0 0 0 0 0 0 0 0 0

analytics 2 0 0 0 0 0 0 0 0 0

bigdata 1 0 0 0 0 1 0 0 0 0

business 0 0 0 0 0 0 0 0 0 0

cloud 0 0 0 0 0 0 0 0 0 0

company 0 0 0 0 0 0 0 0 0 0

data 0 0 0 0 0 0 0 0 0 0

datascience 2 0 0 0 0 1 0 0 1 0

hadoop 0 0 0 0 0 0 0 0 0 0

privacy 0 0 0 0 0 0 0 0 0 0

분석 및 시각화

Association

tm::findAssocs

findAssocs 함수를 이용해서 해당 단어와 연관성이 일정 수치 이상인 단어만 추출

findAssocs(my.TDM, 'warehouse', 0.5)

$warehouse

"20220119t150206z… antifraud booms

0.77 0.77 0.77

datas…", insurtech rages

0.77 0.77 0.77

blog cll colearninglounge

0.63 0.63 0.63

comment section tier

0.63 0.63 0.63

link

0.51

transaction 형으로 변환 및 연관분석(apriori)

transaction_m<-as(terms.m, "transactions")

rules.all<-apriori(transaction_m, parameter=lst(sup=0.01, conf=0.5))

inspect(rules.all)

워드 클라우드

library(wordcloud)

mt.TDM.m<-as.matrix(my.TDM)

term.freq<-sort(rowSums(my.TDM.m), decreasing=T)

wordcloud(words=names(term.freq), freq=term.freq, min.freq=15, random.order=F, colors=brewer.pal(8, 'Dark2'))

감정 분석(Sentimen Analysis)

score.sentiment<-function(sentences, pos.words, neg.words, .progress='none'){

require(plyr)

require(stringr)

scores<-laply(sentences, function(sentence, pos.words, neg.words){

sentence<-gsub('[[:punct:]]', '', sentence)

sentence<-gsub('[[:cntrl:]]', '', sentence)

sentence<-gsub('\\d+', '', sentence)

sentence<-tolower(sentence)

word.list<-str_split(sentence, '\\s+')

words<-unlist(word.list)

pos.matches<-match(words, pos.words)

neg.matches<-match(words, neg.words)

# match() returns the position of the matched term or NA

pos.matches<-!is.na(pos.matches)

neg.matches<-!is.na(neg.matches)

score<-sum(pos.matches)-sum(neg.matches)

return(score)

}, pos.words, neg.words, .progress=.progress)

scores.df<-data.frame(score=scores, text=sentences)

return(scores.df)

}

감성 분석 함수 score.sentiment 사용

sample<-c('Iloveyou', 'Ihateyou', 'What a wonderful day!', 'I hate you')

result<-score.sentiment(sample, pos.word, neg.word)

# pos.word와 neg.word는 사전에 정의해 주어야 함

result$score

tm(with TwitteR dataframe)